perm filename V242.TEX[TEX,DEK]1 blob
sn#362433 filedate 1978-06-22 generic text, type C, neo UTF8
COMMENT ⊗ VALID 00009 PAGES
C REC PAGE DESCRIPTION
C00001 00001
C00002 00002 \input acphdr % Section 4.2
C00003 00003 %folio 259 galley 7b (C) Addison-Wesley 1978 *
C00011 00004 %folio 262 galley 1 (C) Addison-Wesley 1978 *
C00021 00005 %folio 265 galley 2 (C) Addison-Wesley 1978 *
C00029 00006 %folio 269 galley 3 (C) Addison-Wesley 1978 *
C00037 00007 %folio 273 galley 4 (C) Addison-Wesley 1978 *
C00052 00008 %folio 276 galley 5 (C) Addison-Wesley 1978 *
C00065 00009 %folio 282 galley 6 (C) Addison-Wesley 1978 *
C00077 ENDMK
C⊗;
\input acphdr % Section 4.2
\runninglefthead{ARITHMETIC} % chapter title
\titlepage\setcpage0
\vfill
\tenpoint
\ctrline{SECTION 4.2 of THE ART OF COMPUTER PROGRAMMING}
\ctrline{$\copyright$ 1978 Addison--Wesley Publishing Company, Inc.}
\vfill
\runningrighthead{FLOATING-POINT ARITHMETIC}
section{4.2}
\eject
\setcpage 196
%folio 259 galley 7b (C) Addison-Wesley 1978 *
\sectionbegin{4.2. FLOATING-POINT ARITHMETIC}
I{\:cN THIS SECTION}, we shall study the basic principles
of doing arithmetic on ``floating-point'' numbers, by analyzing
the internal mechanism of such calculations. Perhaps many readers
will have little interest in this subject, since their computers
either have built-in floating-point instructions or their computer
manufacturer has supplied suitable subroutines. But, in fact,
the material of this section should not merely be the concern
of computer-design engineers or of a small clique of people
who write library subroutines for new machines; {\sl every}
well-rounded programmer ought to have a knowledge of what goes
on during the elementary steps of floating-point arithmetic.
This subject is not at all as trivial as most people think;
it involves a surprising amount of interesting information.
\runningrighthead{SINGLE-PRECISION CALCULATIONS}
section{4.2.1}
\sectionskip
\sectionbegin{4.2.1. Single-Precision Calculations}
\def\\{\raise 3.444pt\hjust{$\scriptscriptstyle\!\!\not\,\,$}} % \\h will be h bar
{\bf A. Floating-point notation.}\xskip We have
discussed ``fixed-point'' notation for numbers in Section 4.1;
in such a case the programmer knows where the radix point is
assumed to lie in the numbers he manipulates. For many purposes
it is considerably more convenient to let the position of the
radix point be dynamically variable or ``floating'' as a program
is running, and to carry with each number an indication of its
current radix point position. This idea has been used
for many years in scientific calculations, especially for expressing
very large numbers like Avogadro's number $N = 6.02252 \times
10↑{23}$, or very small numbers like Planck's constant $\\h = 1.0545
\times 10↑{-27}$ erg sec.
In this section we shall work with {\sl base $b$, excess $q$, floating-point
numbers with $p$ digits}\/: Such numbers will be represented by pairs of values
$(e, f)$, denoting
$$(e, f) = f \times b↑{e-q}.\eqno (1)$$
Here $e$ is an integer having a specified range,
and $f$ is a signed fraction. We will adopt the convention that
$$|f| < 1;$$
in other words, the radix point appears at the
left of the positional representation of $f$. More precisely,
the stipulation that we have $p$-digit numbers means that $b↑pf$
is an integer, and that
$$-b↑p < b↑pf < b↑p.\eqno (2)$$
The term ``floating binary'' implies that $b =
2$, ``floating decimal'' implies $b = 10$, etc. Using excess
50 floating-decimal numbers with 8 digits, we can write, for
example,
$$\vcenter{\halign{\lft{#}\quad⊗\rt{$#$}⊗\lft{$\;#$}\cr
Avogadro's number⊗N⊗ = (74, +.60225200);\cr
\noalign{\vskip 2pt}
Planck's constant⊗\\h⊗= (24, +.10545000).\cr}}\eqno(3)$$
The two components $e$ and $f$ of a floating-point
number are called the {\sl exponent} and the {\sl fraction}
parts, respectively. (Other names are occasionally used for
this purpose, notably ``characteristic'' and ``mantissa''; but
it is an abuse of terminology to call the fraction part a mantissa,
since this concept has quite a different meaning in connection
with logarithms. Furthermore the English word mantissa means
``a worthless addition.'')
The \MIX\ computer assumes that its floating-point numbers
have the form
$$\def\\{\vrule height 12.4pt depth 7.4pt}
\vcenter{\hjust to 124pt{\hfill\vjust{\baselineskip0pt\lineskip0pt
\hrule
\hjust{\\ \hjust to 20pt{\hfill$\pm$\hfill\\}\!
\hjust to 20pt{\hfill$e$\hfill\\}\!
\hjust to 20pt{\hfill$f$\hfill\\}\!
\hjust to 20pt{\hfill$f$\hfill\\}\!
\hjust to 20pt{\hfill$f$\hfill\\}\!
\hjust to 20pt{\hfill$f$\hfill\\}}
\hrule}\hfill}}.\eqno(4)$$
Here we have base $b$, excess $q$, floating-point
notation with four bytes of precision, where $b$ is the
byte size (e.g., $b = 64$ or $b = 100)$, and $q$ is equal to
$\lfloor {1\over 2}b\rfloor $. The fraction part is $\pm\,f\,f\,f\,f$,
and $e$ is the exponent, which lies in the range $0 ≤
e < b$. This internal representation is typical of the conventions
in most existing computers, although $b$ is a much larger base
than usual.
\subsectionbegin{B. Normalized calculations} A floating-point
number $(e, f)$ is said to be {\sl normalized} if the most significant
digit of the representation of $f$ is nonzero, so that
$$1/b ≤ |f| < 1;\eqno (5)$$
or if $f = 0$ and $e$ has its smallest possible
value. It is possible to tell which of two normalized floating-point
numbers has a greater magnitude by comparing the exponent parts
first, and then testing the fraction parts only if the exponents
are equal.
%folio 262 galley 1 (C) Addison-Wesley 1978 *
Most floating-point routines now in use deal almost
entirely with normalized numbers: inputs to the routines are
assumed to be normalized, and the outputs are always normalized.
Under these conventions we lose the ability to represent a few
numbers of very small magnitude---for example, the value $(0,
.00000001)$ can't be normalized without producing a negative
exponent---but we gain in speed, uniformity, and the ability
to give relatively simple bounds on the relative error in our
computations. (Unnormalized floating-point arithmetic is discussed
in Section 4.2.2.)
Let us now study the normalized floating-point
operations in detail. At the same time we can consider the construction
of subroutines for these operations, assuming that we have
a computer without built-in floating-point hardware.
Machine-language subroutines for floating-point
arithmetic are usually written in a very machine-dependent manner,
using many of the wildest idiosyncrasies of the computer at
hand; so floating-point addition subroutines
for two different machines usually bear little superficial resemblance
to each other. Yet a careful study of numerous subroutines
for both binary and decimal computers reveals that these programs
actually have quite a lot in common, and it is possible to discuss
the topics in a machine-independent way.
\yskip The first (and by far the most difficult!) algorithm we shall
discuss in this section is a procedure for floating-point addition:
$$(e↓u, f↓u) \oplus (e↓v, f↓v) = (e↓w, f↓w).\eqno (6)$$
{\sl Note: Since floating-point arithmetic is
inherently approximate, not exact, we will use ``round'' symbols
$$\oplus ,\quad \ominus ,\quad \otimes ,\quad \odiv$$
to stand for floating-point addition, subtraction,
multiplication, and division, respectively, in order to distinguish
approximate operations from the true ones.}
\yskip The basic idea involved in floating-point
addition is fairly simple: Assuming that $e↓u ≥ e↓v$, we take
$e↓w = e↓u$, $f↓w = f↓u + f↓v/b↑{e↓u-e↓v}$ (thereby aligning the
radix points for a meaningful addition), and normalize the result.
Several situations can arise which make this process nontrivial,
and the following algorithm explains the method more precisely.
\algbegin Algorithm A (Floating-point addition).
Given base $b$, excess $q$, $p$-digit, normalized floating-point
numbers $u = (e↓u, f↓u)$ and $v = (e↓v, f↓v)$, this algorithm
forms the sum $w = u \oplus v$. The same procedure may be used
for floating-point subtraction, if $-v$ is substituted for $v$.
\algstep A1. [Unpack.] Separate the
exponent and fraction parts of the representations
of $u$ and $v$.
\topinsert{\vskip 60mm
\ctrline{\caption Fig. 2. Floating-point addition.}}
\algstep A2. [Assume $e↓u ≥ e↓v$.] If $e↓u < e↓v$, interchange
$u$ and $v$. (In many cases, it is best to combine step A2 with
step A1 or with some of the later steps.)
\algstep A3. [Set $e↓w$.] Set $e↓w ← e↓u$.
\algstep A4. [Test $e↓u - e↓v$.] If $e↓u - e↓v ≥ p + 2$ (large
difference in exponents), set $f↓w ← f↓u$ and go to step A7.
(Actually, since we are assuming that $u$ is normalized, we
could terminate the algorithm; but it is often useful to be
able to normalize a possibly unnormalized number by adding zero
to it.)
\algstep A5. [Scale right.] Shift $f↓v$ to the right $e↓u -
e↓v$ places; i.e., divide it by $b↑{e↓u-e↓v}$. {\sl Note:} This
will be a shift of up to $p + 1$ places, and the next step (which
adds $f↓u$ to $f↓v$) thereby requires an accumulator capable
of holding $2p + 1$ base-$b$ digits to the right of the radix
point. If such a large accumulator is not available, it is possible
to shorten the requirement to $p + 2$ or $p + 3$ places if proper
precautions are taken; the details are given in exercise 5.
\algstep A6. [Add.] Set $f↓w ← f↓u + f↓v$.
\algstep A7. [Normalize.] (At this point $(e↓w, f↓w)$ represents
the sum of $u$ and $v$, but $|f↓w|$ may have more than $p$ digits,
and it may be greater than unity or less than $1/b$.) Perform
Algorithm N below, to normalize and round $(e↓w, f↓w)$ into
the final answer.\quad\blackslug
\algbegin Algorithm N (Normalization).
A ``raw exponent'' $e$ and a ``raw fraction'' $f$ are converted
to normalized form, rounding if necessary to $p$ digits. This
algorithm assumes that $|f| < b$.
\algstep N1. [Test $f$.] If $|f|≥1$, (``fraction
overflow''), go to step N4. If $f = 0$, set $e$ to its lowest
possible value and go to step N7.
\topinsert{\vskip 56mm
\ctrline{\caption Fig. 3. Normalization of $(e,f)$.}}
\algstep N2. [Is $f$ normalized?] If $|f| ≥ 1/b$, go to step
N5.
\algstep N3. [Scale left.] Shift $f$ to the left by one digit
position (i.e., multiply it by $b$), and decrease $e$ by 1.
Return to step N2.
\algstep N4. [Scale right.] Shift $f$ to the right by one digit
position (i.e., divide it by $b$), and increase $e$ by 1.
\algstep N5. [Round.] Round $f$ to $p$ places. (We take this
to mean that $f$ is changed to the nearest multiple of $b↑{-p}$.
It is possible that $(b↑pf)\mod 1 = {1\over 2}$ so that there
are {\sl two} nearest multiples; if $b$ is even, we choose that
one which makes $b↑pf + {1\over 2}b$ odd. Further discussion
of rounding appears in Section 4.2.2.) It is important to note
that this rounding operation can make $|f| = 1$ (``rounding
overflow''); in such a case, return to step N4.
\algstep N6. [Check $e$.] If $e$ is too large, i.e., larger
than its allowed range, an {\sl exponent overflow} condition
is sensed. If $e$ is too small, an {\sl exponent underflow}
condition is sensed. (See the discussion below; since the result
cannot be expressed as a normalized floating-point number in
the required range, special action is necessary.)
\algstep N7. [Pack.] Put $e$ and $f$ together into the desired
output representation.\quad\blackslug
\yyskip Some simple examples of floating-point addition are given in exercise 4.
%folio 265 galley 2 (C) Addison-Wesley 1978 *
\yyskip The following \MIX\ subroutines, for addition and subtraction
of numbers having the form (4), show how Algorithms A and N
can be expressed as computer programs. The subroutines below
are designed to take one input $u$ from symbolic location {\tt ACC},
and the other input $v$ comes from register A upon entrance
to the subroutine. The output $w$ appears both in register A
and location {\tt ACC}. Thus, a fixed-point coding sequence
$$\hjust{\tt LDA A;\quad ADD B;\quad SUB C;\quad STA D}\eqno (7)$$
would correspond to the floating-point coding sequence
$$\hjust{\tt LDA A, STA ACC; LDA B, JMP FADD;
LDA C, JMP FSUB; STA D}.\eqno (8)$$
\vskip -6pt
\algbegin Program A (Addition, subtraction,
and normalization). The following program is a subroutine
for Algorithm A, and it is also designed so that the normalization
portion can be used by other subroutines which appear later
in this section. In this program and in many other programs
throughout this chapter, {\tt OFLO} stands for a subroutine which
prints out a message to the effect that \MIX's overflow toggle
was unexpectedly found to be ``on.'' The byte size $b$ is assumed
to be a multiple of 4. The normalization routine {\tt NORM} assumes
that $\rI2 = e$ and $\rAX = f$, where $\rA = 0$ implies $\rX = 0$ and
$\rI2 < b$.
\yyskip
\mixfour{00⊗BYTE⊗EQU⊗1(4:4)⊗Byte size $b$\cr
01⊗EXP⊗EQU⊗1:1⊗Definition of exponent field\cr
02⊗FSUB⊗STA⊗TEMP⊗Floating-point subtraction subroutine:\cr
03⊗⊗LDAN⊗TEMP⊗Change sign of operand.\cr
04⊗FADD⊗STJ⊗EXITF⊗Floating-point addition subroutine:\cr
05⊗⊗JOV⊗OFLO⊗Ensure overflow is off.\cr
06⊗⊗STA⊗TEMP⊗$\hjust{\tt TEMP}← v$.\cr
07⊗⊗LDX⊗ACC⊗$\rX ← u$.\cr
\\
08⊗⊗CMPA⊗ACC(EXP)⊗$\underline{\hjust{Ste}}$p\hskip-3pt$\underline{\hjust{\hskip3pt
s A1}}$,$\underline
{\hjust{ A2}}$,$\underline{\hjust{ A3 are combined here:}}$\cr
09⊗⊗JGE⊗1F⊗Jump if $e↓v ≥ e↓u$.\cr
10⊗⊗STX⊗FU(0:4)⊗$\hjust{\tt FU} ← \pm\,f\,f\,f\,f\,0$.\cr
11⊗⊗LD2⊗ACC(EXP)⊗$\rI2 ← e↓w$.\cr
12⊗⊗STA⊗FV(0:4)⊗\cr
13⊗⊗LD1N⊗TEMP(EXP)⊗$\rI1 ← -e↓v$.\cr
14⊗⊗JMP⊗4F\cr
\\
15⊗1H⊗STA⊗FU(0:4)⊗$\hjust{\tt FU} ← \pm\,f\,f\,f\,f\,0$ ($u, v$ interchanged).\cr
16⊗⊗LD2⊗TEMP(EXP)⊗$\rI2 ← e↓w$.\cr
17⊗⊗STX⊗FV(0:4)\cr
18⊗⊗LD1N⊗ACC(EXP)⊗$\rI1 ← -e↓v$.\cr
\\
19⊗4H⊗INC1⊗0,2⊗$\rI1 ← e↓u - e↓v$. (Step A4 unnecessary.)\cr
\\
20⊗5H⊗LDA⊗FV⊗\understep{A5. Scale ri}\sl g\understep{ht.}\cr
21⊗⊗ENTX⊗0⊗Clear rX.\cr
22⊗⊗SRAX⊗0,1⊗Shift right $e↓u - e↓v$ places.\cr
\\
23⊗6H⊗ADD⊗FU⊗\understep{A6. Add.}\cr
\\
24⊗⊗JOV⊗N4⊗\understep{A7. Normalize.} Jump if fraction overflow.\cr
25⊗⊗JXZ⊗NORM⊗Easy case?\cr
\\
26⊗⊗CMPA⊗=0=(1:1)⊗Is $f$ normalized?\cr
27⊗⊗JNE⊗N5⊗If so, round it.\cr
28⊗⊗SRC⊗5⊗$|\rX|↔|\rA|$.\cr
29⊗⊗DECX⊗1⊗(rX is positive.)\cr
30⊗⊗STA⊗TEMP⊗(Operands had opposite signs,\cr
31⊗⊗STA⊗HALF(0:0)⊗\quad registers must be adjusted\cr
32⊗⊗LDAN⊗TEMP⊗\quad before rounding and normalization.)\cr
33⊗⊗ADD⊗HALF\cr
34⊗⊗ADD⊗HALF⊗Complement least significant half.\cr
35⊗⊗SRC⊗4⊗Jump into normalization routine.\cr
36⊗⊗JMP⊗N3A\cr
\\
37⊗HALF⊗CON⊗1//2⊗One half the word size (Sign varies)\cr
38⊗FU⊗CON⊗0⊗Fraction part $f↓u$\cr
39⊗FV⊗CON⊗0⊗Fraction part $f↓v$\cr
\noalign{\penalty-300\vskip 3pt plus 6pt minus 2pt}
40⊗NORM⊗JAZ⊗ZRO⊗\understep{N1. Test }$f$\hskip-2pt\understep{\hskip2pt.}\cr
41⊗N2⊗CMPA⊗=0=(1:1)⊗\understep{N2. Is }$f$\hskip-2pt\understep{\hskip2pt\
normalized?}\cr
42⊗⊗JNE⊗N5⊗To N5 if leading byte nonzero.\cr
\\
43⊗N3⊗SLAX⊗1⊗\understep{N3. Scale left.}\cr
44⊗N3A⊗DEC2⊗1⊗Decrease $e$ by 1.\cr
45⊗⊗JMP⊗N2⊗Return to N2.\cr
\\
46⊗N4⊗ENTX⊗1⊗\understep{N4. Scale ri}\sl g\understep{ht.}\cr
47⊗⊗SRC⊗1⊗Shift right, insert ``1'' with proper sign.\cr
48⊗⊗INC2⊗1⊗Increase $e$ by 1.\cr
\\
49⊗N5⊗CMPA⊗=BYTE/2=(5:5)⊗\understep{N5. Round.}\cr
50⊗⊗JL⊗N6⊗Is $|$tail$| < {1\over 2}b$?\cr
51⊗⊗JG⊗5F\cr
52⊗⊗JXNZ⊗5F⊗Is $|$tail$| > {1\over 2}b$?\cr
53⊗⊗STA⊗TEMP⊗$|$tail$| = {1\over 2}b$; round to odd.\cr
54⊗⊗LDX⊗TEMP(4:4)\cr
55⊗⊗JXO⊗N6⊗To N6 if rX is odd.\cr
\\
56⊗5H⊗STA⊗*+1(0:0)⊗Store sign of rA.\cr
57⊗⊗INCA⊗BYTE⊗Add $b↑{-4}$ to $|f|$. (Sign varies)\cr
58⊗⊗JOV⊗N4⊗Check for rounding overflow.\cr
\\
59⊗N6⊗J2N⊗EXPUN⊗\understep{N6. Check $e$.} Underflow if $e < 0$.\cr
\\
60⊗N7⊗ENTX⊗0,2⊗\understep{N7. Pack.} $\rX ← e$.\cr
61⊗⊗SRC⊗1\cr
62⊗ZRO⊗DEC2⊗BYTE⊗$\rI2 ← e - b$.\cr
\\
63⊗8H⊗STA⊗ACC\cr
64⊗EXITF⊗J2N⊗*⊗Exit, unless $e ≥ b$.\cr
65⊗EXPOV⊗HLT⊗2⊗Exponent overflow detected\cr
66⊗EXPUN⊗HLT⊗1⊗Exponent underflow detected\cr
67⊗ACC⊗CON⊗0⊗Floating-point accumulator\quad\blackslug\cr}
%folio 269 galley 3 (C) Addison-Wesley 1978 *
\yyskip The rather long section of code from lines 25 to 37
is needed because \MIX\ has only a 5-byte accumulator for adding
signed numbers while in general $2p + 1 = 9$ places of accuracy
are required by Algorithm A. The program could be shortened
to about half its present length if we were willing to sacrifice
a little bit of its accuracy, but we shall see in the next section
that full accuracy is important. Line 55 uses a nonstandard
\MIX\ instruction defined in Section 4.5.2. The running time for
floating-point addition and subtraction depends on several factors
which are analyzed in Section 4.2.4.
Now let us consider multiplication
and division, which are simpler than addition, and which are
somewhat similar to each other.
\algbegin Algorithm M (Floating-point multiplication
or division). Given base $b$, excess $q$, $p$-digit, normalized
floating-point numbers $u = (e↓u, f↓u)$ and $v = (e↓v, f↓v)$,
this algorithm forms the product $w = u \otimes v$ or the quotient
$w = u \odiv v$.
\algstep M1. [Unpack.] Separate the exponent
and fraction parts of the representations of $u$ and $v$. (Sometimes
it is convenient, but not necessary, to test the operands for
zero during this step.)
\algstep M2. [Operate.] Set
$$\vcenter{\halign{\lft{$#$}\quad⊗\lft{$#$}\quad⊗\lft{#}\cr
e↓w ← e↓u + e↓v - q,⊗f↓w ← f↓uf↓v⊗for multiplication;\cr
\noalign{\vskip 3pt}
e↓w ← e↓u - e↓v + q + 1,⊗f↓w ← (b↑{-1}f↓u)/f↓v⊗for division.\cr}}\eqno(9)$$
(Since the input numbers are assumed to
be normalized, it follows that either $f↓w = 0$, or $1/b↑2 ≤
|f↓w| < 1$, or a division-by-zero error has occurred.) If necessary,
the representation of $f↓w$ may be reduced to $p+2$ or $p+3$
digits at this point, as in exercise 5.
\algstep M3. [Normalize.] Perform Algorithm N on
$(e↓w, f↓w)$ to normalize, round, and pack the result. ({\sl
Note:} Normalization is simpler in this case, since scaling
left occurs at most once, and rounding overflow cannot occur
after division.)\quad\blackslug
\yyskip The following \MIX\ subroutines,
which use the same conventions as Program A above, illustrate
the machine considerations necessary in connection with Algorithm
M.
\algbegin Program M (Floating-point multiplication and division).
\yyskip
\mixfour{01⊗Q⊗EQU⊗BYTE/2⊗$q$ is half the byte size\cr
02⊗FMUL⊗STJ⊗EXITF⊗Floating-point multiplication subroutine:\cr
03⊗⊗JOV⊗OFLO⊗Ensure overflow is off.\cr
04⊗⊗STA⊗TEMP⊗$\hjust{\tt TEMP} ← v$.\cr
05⊗⊗LDX⊗ACC⊗$\rX ← u$.\cr
06⊗⊗STX⊗FU(0:4)⊗$\hjust{\tt FU} ← \pm\,f\,f\,f\,f\,0$.\cr
\\
07⊗⊗LD1⊗TEMP(EXP)\cr
08⊗⊗LD2⊗ACC(EXP)\cr
09⊗⊗INC2⊗-Q,1⊗$\rI2 ← e↓u + e↓v - q$.\cr
\\
10⊗⊗SLA⊗1\cr
11⊗⊗MUL⊗FU⊗Multiply $f↓u$ times $f↓v$.\cr
12⊗⊗JMP⊗NORM⊗Normalize, round, and exit.\cr
\noalign{\penalty-300\vskip 3pt plus 6pt minus 2pt}
13⊗FDIV⊗STJ⊗EXITF⊗Floating-point division subroutine:\cr
14⊗⊗JOV⊗OFLO⊗Ensure overflow is off.\cr
15⊗⊗STA⊗TEMP⊗$\hjust{\tt TEMP} ← v$.\cr
16⊗⊗STA⊗FV(0:4)⊗$\hjust{\tt FV} ← \pm\,f\,f\,f\,f\,0$.\cr
\\
17⊗⊗LD1⊗TEMP(EXP)\cr
18⊗⊗LD2⊗ACC(EXP)\cr
19⊗⊗DEC2⊗-Q,1⊗$\rI2 ← e↓u - e↓v + q$.\cr
\\
20⊗⊗ENTX⊗0\cr
21⊗⊗LDA⊗ACC\cr
22⊗⊗SLA⊗1⊗$\rA ← f↓u$.\cr
23⊗⊗CMPA⊗FV(1:5)\cr
24⊗⊗JL⊗*+3⊗Jump if $|f↓u| < |f↓v|$.\cr
25⊗⊗SRA⊗1⊗Otherwise, scale $f↓u$ right\cr
26⊗⊗INC2⊗1⊗\qquad and increase rI2 by 1.\cr
\\
27⊗⊗DIV⊗FV⊗Divide.\cr
28⊗⊗JNOV⊗NORM⊗Normalize, round, and exit.\cr
29⊗DVZRO⊗HLT⊗3⊗Unnormalized or zero divisor.\quad\blackslug\cr}
\yyskip The most noteworthy feature of this program
is the provision for division in lines 23--26, which is made
in order to ensure enough accuracy to round the answer. If $|f↓u|
< |f↓v|$, straightforward application of Algorithm M would leave
a result of the form ``$\pm\,0\,f\,f\,f\,f$'' in register A, and
this would not allow a proper rounding without a careful analysis
of the remainder (which appears in register X). So the program
computes $f↓w ← f↓u/f↓v$ in this case, ensuring that $f↓w$ is
either zero or normalized in all cases; rounding can proceed
with five significant bytes, possibly testing whether the remainder
is zero.
\yskip We occasionally need to convert values between fixed-
and floating-point representations. A ``fix-to-float'' routine
is easily obtained with the help of the normalization algorithm
above; for example, in \MIX, the following subroutine converts
an integer to floating-point form:
$$\vcenter{\mixfour{01⊗FLOT⊗STJ⊗EXITF⊗Assume that $\rA=u$, an integer.\cr
02⊗⊗JOV⊗OFLO⊗Ensure overflow is off.\cr
03⊗⊗ENT2⊗Q+5⊗Set raw exponent.\cr
04⊗⊗ENTX⊗0\cr
05⊗⊗JMP⊗NORM⊗Normalize, round, and exit.\quad\blackslug\cr}}\eqno(10)$$
A ``float-to-fix'' subroutine is the subject of exercise 14.
%folio 273 galley 4 (C) Addison-Wesley 1978 *
\yskip The debugging of floating-point subroutines is usually
a difficult job, since there are so many cases to consider.
Here is a list of common pitfalls which often trap a programmer
or machine designer who is preparing floating-point routines:
\yskip\textindent{1)}{\sl Losing the sign.}\xskip On many machines (not \MIX),
shift instructions between registers will affect the sign, and
the shifting operations used in normalizing and scaling numbers
must be carefully analyzed. The sign is also lost frequently
when minus zero is present. (For example, Program A is careful
to retain the sign of register A in lines 30--34. See also exercise
6.)
\yskip\textindent{2)}{\sl Failure to treat exponent underflow or
overflow properly.}\xskip The size of $e↓w$ should not be checked
until {\sl after} the rounding and normalization, because preliminary
tests may give an erroneous indication. Exponent underflow and
overflow can occur on floating-point addition and subtraction,
not only during multiplication and division; and even though
this is a rather rare occurrence, it must be tested each time.
Enough information should be retained that meaningful corrective
actions are possible after overflow or underflow has occurred.
It has unfortunately become customary in many instances
to ignore exponent underflow and simply to set underflowed results
to zero with no indication of error. This causes a serious loss
of accuracy in most cases (indeed, it is the loss of {\sl all}
the significant digits), and the assumptions underlying floating-point
arithmetic have broken down, so the programmer really must be
told when underflow has occurred. Setting the result to zero
is appropriate only in certain cases when the result is later to be
added to a significantly larger quantity. When exponent underflow
is not detected, we find mysterious situations in which $(u
\otimes v) \otimes w$ is zero, but $u \otimes (v \otimes w)$
is not, since $u \otimes v$ results in exponent underflow but
$u \otimes (v \otimes w)$ can be calculated without any exponents
falling out of range. Similarly, we can find positive numbers
$a$, $b$, $c$, $d$, and $y$ such that
$$\eqalign{(a \otimes y \;\oplus\; b) \odiv (c \otimes y
\;\oplus\; d) ⊗\; \approx\;\textstyle {2\over 3},\cr
\noalign{\vskip 3pt}
(a \;\oplus\; b \odiv y) \odiv (c \;\oplus\;d \odiv y) ⊗\;=\; 1\cr}\eqno(11)$$
if exponent underflow is not detected. (See exercise
9.) Even though floating-point routines are not precisely accurate,
such a disparity as (11) is quite unexpected when $a$, $b$, $c$,
$d$, and $y$ are all {\sl positive}\/! Exponent underflow is usually
not anticipated by a programmer, so he needs to be told about
it.
\yskip\textindent{3)}{\sl Inserted garbage.}\xskip When scaling to the
left it is important to keep from introducing anything but zeros
at the right. For example, note the ``{\tt ENTX 0}'' instruction in
line 21 of Program A, and the all-too-easily-forgotten ``{\tt ENTX
0}'' instruction in line 04 of the {\tt FLOT} subroutine (10). (But
it would be a mistake to clear register X after line 27 in the
division subroutine.)
\yskip\textindent{4)}{\sl Unforeseen rounding overflow.}\xskip When a number
like .999999997 is rounded to 8 digits, a carry will occur to
the left of the decimal point, and the result must be scaled
to the right. Many people have mistakenly concluded that rounding
overflow is impossible during multiplication, since they look
at the maximum value of $|f↓uf↓v|$ which is $1 - 2b↑{-p} + b↑{-2p}$,
and this cannot round up to 1. The fallacy in this reasoning
is exhibited in exercise 11. Curiously, the phenomenon of rounding
overflow {\sl is} impossible during floating-point division (see
exercise 12).
({\sl Note:} There is one school of thought which
says it is harmless to ``round'' .999999997 to .99999999 instead
of to 1.0000000, since this does not increase the worst-case
bounds on relative error. Furthermore the floating-point number
1.0000000 may be said to represent all real values in the interval
$[1.0000000 - 5 \times 10↑{-8}, 1.0000000 + 5 \times 10↑{-8}]$,
while .99999999 represents all values in the much smaller interval
$(.99999999 - 5 \times 10↑{-9}, .99999999 + 5 \times 10↑{-9})$.
Even though the latter interval does not contain the original
value .999999997, each number of the second interval is contained
in the first, so subsequent calculations with the second interval
are no less accurate than with the first. But this argument
is incompatible with the mathematical philosophy of floating-point
arithmetic which is expressed in Section 4.2.2.)
\yskip\textindent{5)}{\sl Rounding before normalizing.}\xskip Inaccuracies
are caused by premature rounding in the wrong digit position.
This error is obvious when rounding is being done to the left
of the appropriate position; but it is also dangerous in the
less obvious cases where rounding is first done too far to the
right and followed by rounding in the true position. For
this reason it is a mistake to round during the ``scaling-right''
operation in step A5, except as prescribed in exercise 5.
(The special case of rounding in step N5, then rounding again
after rounding overflow has occurred, is harmless, however, because rounding
overflow always yields $\pm 1.0000000$ and this is unaffected by the subsequent
rounding process.)
\yskip\textindent{6)}{\sl Failure to retain enough precision in intermediate
calculations.}\xskip Detailed analyses of the accuracy of floating-point
arithmetic, made in the next section, suggest strongly that
normalizing floating-point routines should always deliver a
properly rounded result to the maximum possible accuracy. There
should be no exceptions to this dictum, even in cases which
occur with extremely low probability; the appropriate number
of significant digits should be retained throughout the computations,
as stated in Algorithms A and M.
\subsectionbegin{C. Floating-point hardware} Nearly
every large computer intended for scientific calculations includes
floating-point arithmetic as part of its repertoire of built-in
operations. Unfortunately, the design of such hardware usually
includes some anomalies which result in dismally poor behavior
in certain circumstances, and we hope that future computer designers
will pay more attention to providing the proper behavior than
they have in the past. It costs only a little more to build the
machine right, and considerations in the following section show
that substantial benefits will be gained.
The \MIX\ computer, which is being used as an example
of a ``typical'' machine in this series of books, has an optional
``floating-point attachment'' (available at extra cost) which
includes the following seven operations:
\yyskip\noindent $\bullet$ {\tt FADD}, {\tt FSUB}, {\tt FMUL}, {\tt FDIV},
{\tt FLOT}, {\tt FCMP} (C = 1, 2, 3, 4, 5, 56, respectively; \hjust{F = 6}).
\xskip The contents of rA after the operation ``{\tt FADD V}'' are
precisely the same as the contents of rA after the operations
$$\vjust{\mixtwo{STA⊗ACC\cr LDA⊗V\cr JMP⊗FADD\cr}}$$
where {\tt FADD} is the subroutine which appears earlier
in this section, except that both operands are automatically
normalized before entry to the subroutine if they are not already
in normalized form. (If exponent underflow occurs during this
pre-normalization, but not during the normalization of the answer,
no underflow is signalled.) Similar remarks apply to {\tt FSUB}, {\tt FMUL},
and {\tt FDIV}. The contents of rA after the operation ``{\tt FLOT}'' are
the contents after ``{\tt JMP FLOT}'' in the subroutine (10) above.
The contents of rA are unchanged by the operation
``{\tt FCMP V}''; this instruction sets the comparison indicator to
less, equal, or greater, depending on whether the contents of
rA are ``definitely less than,'' ``approximately equal to,''
or ``definitely greater than'' {\tt V}; this subject is discussed
in the next section, and the precise action is defined by the
subroutine {\tt FCMP} of exercise 4.2.2--17 with {\tt EPSILON} in location
0.
No register other than rA is affected by any of
the floating-point operations. If exponent overflow or underflow
occurs, the overflow toggle is turned on and the exponent of
the answer is given modulo the byte size. Division by zero leaves
undefined garbage in rA. The respective execution times are $4u$,
$4u$, $9u$, $11u$, $3u$, $4u$.
\yskip\noindent $\bullet$ {\tt FIX} (C = 5; F = 7).\xskip The contents of
rA are replaced by the integer ``round(rA)'', rounding to the nearest integer
as in step N5 of Algorithm N.
However, if this answer is too large to fit in the register,
the overflow toggle is set on and the result is undefined. Execution time: $3u$.
\yskip Sometimes it is helpful to use floating-point operators
in a nonstandard way. For example, if the operation {\tt FLOT}
had not been included as part of \MIX's floating-point attachment,
we could easily achieve its effect on 4-byte numbers by writing
$$\vcenter{\mixthree{FLOT⊗STJ⊗9F\cr
⊗SLA⊗1\cr
⊗ENTX⊗Q+4\cr
⊗SRC⊗1\cr
⊗FADD⊗=0=\cr
9H⊗JMP⊗*⊗\quad\blackslug\cr}}\eqno(12)$$
This routine is not strictly equivalent to
the {\tt FLOT} operator, since it assumes that the 1:1 byte of rA
is zero, and it destroys rX. The handling of more general situations
is a little tricky, because rounding overflow can occur even
during a {\tt FLOT} operation.
%folio 276 galley 5 (C) Addison-Wesley 1978 *
Similarly, suppose \MIX\ had a {\tt FADD} operation but not {\tt FIX}.
If we wanted to round a number $u$ from floating-point
form to the nearest fixed-point integer, and if we knew that
the number was nonnegative and would fit in at most three bytes, we
could write
$$\hjust{\tt FADD\quad FUDGE}$$
where {\tt FUDGE} contains the constant
$$\def\\{\vrule height 12.4pt depth 7.4pt}
\vcenter{\hjust to 154pt{\hfill\vjust{\baselineskip0pt\lineskip0pt
\hrule
\hjust{\\ \hjust to 25pt{\hfill{\tt+}\hfill\\}\!
\hjust to 25pt{\hfill{\tt Q+4}\hfill\\}\!
\hjust to 25pt{\hfill{\tt1}\hfill\\}\!
\hjust to 25pt{\hfill{\tt0}\hfill\\}\!
\hjust to 25pt{\hfill{\tt0}\hfill\\}\!
\hjust to 25pt{\hfill{\tt0}\hfill\\}}
\hrule}\hfill}};$$
the result in rA would be
$$\def\\{\vrule height 12.4pt depth 7.4pt}
\def\¬{\vrule height 3pt}
\vcenter{\hjust to 154pt{\hfill\vjust{\baselineskip0pt\lineskip0pt
\hrule
\hjust{\\ \vjust{\hjust to 25pt{\hfill{\tt+}\hfill\\}\vfill}\!
\vjust{\hjust to 25pt{\hfill{\tt Q+4}\hfill\\}\vfill}\!
\vjust{\hjust to 25pt{\hfill{\tt1}\hfill\\}\vfill}\!
\vjust{
\hjust to 75pt{\hfill\¬\hfill\¬\hfill\¬}
\hjust to 75pt{\hfill round($u$)\hfill \vrule height 9.4pt depth 4.4pt}
\hjust to 75pt{\hfill\¬\hfill\¬\hfill\¬}}}
\hrule}\hfill}}.\eqno(13)$$
Some machines have floating-point
hardware which suppresses automatic rounding and gives additional
information about the less-significant digits that would ordinarily
be rounded off. Such facilities are of use in extending the
precision of floating-point calculations, but their implementation
is usually subject to unfortunate anomalies, and unrounded results
are generally undesirable in single-precision calculations.
\subsectionbegin{D. History and bibliography} The origins
of floating-point notation can be traced back to Babylonian mathematicians
(1800 {\:m B.C.} or earlier), who made extensive
use of radix-60 floating-point arithmetic but did not have a
notation for the exponents. The appropriate exponent was always
somehow ``understood'' by the man doing the calculations. At
least one case has been found in which the wrong answer was
given because addition was performed with improper alignment
of the operands, but such examples are very rare; see O. Neugebauer,
{\sl The Exact Sciences in Antiquity} (Princeton, N. J.: Princeton University
Press, 1952), 26--27. Another early contribution to floating-point
notation is due to the Greek mathematician Apollonius (3rd century
{\:m B.C.}), who apparently was
the first to explain how to simplify multiplication by collecting
powers of 10 separately from their coefficients, at least in
simple cases. [For a discussion of Apollonius's method, see
Pappus, {\sl Mathematical Collections} (4th century {\:m A.D.}).]
After the Babylonian civilization
died out, the first significant uses of floating-point notation
for products and quotients did not emerge until much later,
about the time logarithms were invented (1600) and shortly afterwards
when Oughtred invented the slide rule (1630). The modern notation
``$\,x↑n\,$'' for exponents was being introduced at about the same
time; separate symbols for $x$ squared, $x$ cubed, etc.\ had
been in use before this.
Floating-point arithmetic was incorporated into
the design of some of the earliest computers. It was independently
proposed by Leonardo Torres y Quevedo in Madrid, 1914; by Konrad
Zuse in Berlin, 1936; and by George Stibitz in New Jersey, 1939.
Zuse's machines used a floating-binary representation which
he called ``semi-logarithmic notation''; he also incorporated
conventions for dealing with special quantities like ``$∞$'' and
``undefined.'' The first American computers to operate with
floating-point arithmetic hardware were the Bell Laboratories'
Model V and the Harvard Mark II, both of which were relay calculators
designed in 1944. [See B. Randell, {\sl The Origins of Digital
Computers} (Berlin: Springer, 1973), pp.\ 100, 155, 163--164,
259--260; {\sl Proc.\ Symp.\ on Large-Scale Digital Calculating
Machinery} (Harvard, 1947), 41--68, 69--79; {\sl Datamation
\bf 13} (April 1967), 35--44, (May 1967), 45--49; {\sl Zeitschrift
f\"ur angew.\ Math.\ und Physik \bf 1} (1950),
345--346.]
The use of floating-binary arithmetic
was seriously considered in 1944--1946 by researchers at the
Moore School in their plans for the first {\sl electronic} digital
computers, but it turned out to be much harder to implement
floating-point circuitry with tubes than with relays. The group
realized that scaling is a problem in programming, but felt
that it is only a very small part of a total programming job
which is usually worth the time and trouble it takes, since
it tends to keep a programmer aware of the numerical accuracy
he is getting. Furthermore, they argued that floating-point
representation takes up valuable memory space, since the exponents
must be stored, and that it becomes difficult to adapt floating-point
arithmetic to multiple-precision calculations. [See von Neumann's
{\sl Collected Works} {\bf 5} (New York: Macmillan, 1963), 43,
73--74.] At this time, of course, they were designing the first
stored-program computer and the second electronic computer,
and their choice had to be either fixed-point {\sl or} floating-point
arithmetic, not both. They anticipated the coding of floating-binary
routines, and in fact ``shift-left'' and ``shift-right'' instructions
were put into their machine primarily to make such routines
more efficient. The first machine to have both kinds of arithmetic
in its hardware was apparently a computer developed at General
Electric Company [see
{\sl Proc.\ 2nd Symp.\ on Large-Scale
Digital Calculating Machinery} (Cambridge: Harvard University
Press, 1951), 65--69].
Floating-point subroutines and interpretive systems
for early machines were coded by D. J. Wheeler and others, and
the first publication of such routines was in {\sl The Preparation
of Programs for an Electronic Digital Computer} by Wilkes, Wheeler,
and Gill (Reading, Mass.: Addison-Wesley, 1951), subroutines
A1--A11, pp.\ 35--37, 105--117. It is interesting to note that
floating-{\sl decimal} subroutines are described here, although
a binary computer was being used; in other words, the numbers
were represented as $10↑ef$, not $2↑ef$, and therefore the scaling
operations required multiplication or division by 10. On this
particular machine such decimal scaling was about as easy as
shifting, and the decimal approach greatly simplified input-output
conversions.
Most published references to the details of floating-point
arithmetic routines are scattered in ``technical memorandums''
distributed by various computer manufacturers, but there have
been occasional appearances of these routines in the open literature.
Besides the reference above, see R. H. Stark and D. B. MacMillan,
{\sl Math.\ Comp.\ \bf 5} (1951), 86--92, where a plugboard-wired
program is described; D. McCracken, {\sl Digital Computer Programming}
(New York: Wiley, 1957), 121--131; J. W. Carr III, {\sl CACM
\bf 2} (May 1959), 10--15; W. G. Wadey, {\sl JACM \bf 7} (1960),
129--139; D. E. Knuth, {\sl JACM \bf 8} (1961), 119--128; O.
Kesner, {\sl CACM \bf 5} (1962), 269--271; F. P. Brooks and
K. E. Iverson, {\sl Automatic Data Processing} (New York: Wiley,
1963), 184--199. For a discussion of floating-point arithmetic
from a computer designer's standpoint, see ``Floating-point
operation'' by S. G. Campbell, in {\sl Planning a computer System,}
ed.\ by W. Buchholz (New York: McGraw-Hill, 1962), 92--121. Additional
references are given in Section 4.2.2.
\exbegin{EXERCISES}
\exno 1. [10] How would
Avogadro's number and Planck's constant be represented in base
100, excess 50, four-digit floating-point notation? $\biglp$This would
be the representation used by \MIX, as in (4), if the byte size is 100.$\bigrp$
%folio 282 galley 6 (C) Addison-Wesley 1978 *
\exno 2. [12] Assume that the exponent
$e$ is constrained to lie in the range $0 ≤ e ≤ E$; what are
the largest and smallest positive values that can be written
as base $b$, excess $q$, $p$-digit floating-point numbers? What
are the largest and smallest positive values that can be written
as {\sl normalized} floating-point numbers with these specifications?
\exno 3. [11] (K. Zuse, 1936.)\xskip Show that
if we are using normalized floating-binary
arithmetic, there is a way to increase the precision slightly
without loss of memory space: A $p$-bit fraction part can be
represented using only $p - 1$ bit positions of a computer word,
if the range of exponent values is decreased very slightly.
\trexno 4. [12] Assume that $b = 10$, $p = 8$. What result does
Algorithm A give for $(50, +.98765432) \oplus (49, +.33333333)$?
For $(53, -.99987654) \oplus (54, +.10000000)$? For $(45, -.50000001)
\oplus (54, +.10000000)$?
\trexno 5. [24] Let us say that $x ~ y$ (with respect to a given
radix $b$) if $x$ and $y$ are real numbers satisfying the following
conditions:
$$\eqalign{\lfloor x/b\rfloor ⊗= \lfloor y/b\rfloor;\cr
\noalign{\vskip 2pt}
x\mod b = 0\qquad⊗\hjust{iff}\qquad y \mod b = 0;\cr
\noalign{\vskip 2pt}
\textstyle 0 < x\mod b < {1\over 2}b\qquad⊗\hjust{iff}\qquad
\textstyle 0 < y \mod b < {1\over 2}b;\cr
\noalign{\vskip 2pt}
\textstyle x \mod b = {1\over 2}b\qquad⊗\hjust{iff}\qquad
\textstyle y \mod b = {1\over 2}b;\cr
\noalign{\vskip 2pt}
\textstyle {1\over 2}b < x \mod b < b\qquad⊗\hjust{iff}\qquad
\textstyle {1\over 2}b < y \mod b < b.\cr}$$
Prove that if $f↓v$ is replaced by $b↑{-p-2}F↓v$
between steps A5 and A6 of Algorithm A, where $F↓v ~ b↑{p+2}f↓v$,
the result of that algorithm will be unchanged. (If $F↓v$ is
an integer and $b$ is even, this operation essentially truncates
$f↓v$ to $p + 2$ places while remembering whether any nonzero
digits have been dropped, thereby minimizing the length of register
which is needed for the addition in step A6.)
\exno 6. [20] If the result of a {\tt FADD} instruction is zero, what
will be the sign of rA, according to the definitions of \MIX's
floating-point attachment given in this section?
\exno 7. [27] Discuss floating-point arithmetic
using balanced ternary notation.
\exno 8. [20] Give examples of normalized eight-digit floating-decimal
numbers $u$ and $v$ for which addition yields\xskip (a) exponent underflow,\xskip
(b) exponent overflow, assuming that exponents must satisfy
$0 ≤ e < 100$.
\exno 9. [M24] (W. M. Kahan.)\xskip Assume
that the occurrence of exponent underflow causes the result
to be replaced by zero, with no error indication given. Using
excess zero, eight-digit floating-decimal numbers with $e$ in
the range $-50 ≤ e < 50$, find positive values of $a$, $b$,
$c$, $d$, and $y$ such that (11) holds.
\exno 10. [12] Give an example of normalized eight-digit floating-decimal
numbers $u$ and $v$ for which rounding overflow occurs in addition.
\trexno 11. [M20] Give an example of normalized, excess 50, eight-digit
floating-decimal numbers\-\ $u$ and $v$ for which rounding overflow
occurs in multiplication.
\exno 12. [M25] Prove that rounding overflow cannot occur during
the normalization phase of floating-point division.
{\def\upper#1{\hjust to 8.889pt{\:u\char'164\hskip-8.889pt\hfill
$\vcenter{\hjust{$\scriptscriptstyle#1$}}$\hfill}}
\def\lower#1{\hjust to 8.889pt{\:u\char'165\hskip-8.889pt\hfill
$\vcenter{\hjust{$\scriptscriptstyle#1$}}$\hfill}}
\exno 13. [30] When doing ``interval arithmetic'' we don't want
to round the results of a floating-point computation; we want
rather to implement operations such as $\lower+$ and $\upper+$ which
give the tightest possible representable bounds on the true
sum: $$u \lower+ v \;≤\; u + v \;≤\; u \upper+ v.$$ How should the algorithms
of this section be modified for such a purpose?}
\exno 14. [25] Write a \MIX\ subroutine which begins with an arbitrary
floating-point number in register A, not necessarily normalized,
and which converts it to the nearest fixed-point integer (or
determines that it is too large in absolute value to make such
a conversion possible).
\trexno 15. [28] Write a \MIX\ subroutine, to be used in connection
with the other subroutines of this section, that calculates
$u\;\mod\;1$, that is, $u - \lfloor u\rfloor$ rounded to nearest
floating-point number, given a floating-point number $u$. Note
that when $u$ is a very small negative number, $u\;\mod\;1$ will
be rounded so that the result is unity (even though $u \mod
1$ has been defined to be always {\sl less} than unity,
as a real number).
\exno 16. [HM21] (Robert L. Smith.)\xskip Design an algorithm to compute
the real and imaginary parts of the complex number $(a + bi)/(c
+ di)$, given real floating-point values $a$, $b$, $c$, and $d$.
Avoid the computation of $c↑2 + d↑2$, since it would cause floating-point
overflow even when $|c|$ or $|d|$ is approximately the
square root of the maximum allowable floating-point value.
\exno 17. [40] (John Cocke.)\xskip Explore the idea of extending the
range of floating-point numbers by defining a single-word representation
in which the precision of the fraction decreases as the magnitude
of the exponent increases.
\exno 18. [25] Consider a binary computer with 36-bit words,
on which positive floating-binary numbers are represented as
$(0e↓1e↓2 \ldotsm e↓8f↓1f↓2 \ldotsm f↓{27})↓2$; here $(e↓1e↓2
\ldotsm e↓8)↓2$ is an excess $(10000000)↓2$ exponent and $(f↓1f↓2
\ldotsm f↓{27})↓2$ is a 27-bit fraction. Negative floating-point
numbers are represented by the {\sl two's complement} of the corresponding
positive representation (see Section 4.1). Thus, 1.5 is {\it 201$\,|$600000000}
in octal notation, while $-1.5$ is {\it 576$\,|$200000000\/}; the
octal representations of 1.0 and $-1.0$ are {\it 201$\,|$400000000} and
{\it 576$\,|$400000000}, respectively. (A vertical line is used here to show
the boundary between exponent and fraction.) Note that bit $f↓1$ of a
normalized positive number is always 1, while it is almost always
zero for negative numbers; the exceptional cases are representations
of $-2↑k$.
Suppose that the exact result of a floating-point
operation has the octal code {\it 572$\,|$740000000$\,|$01\/};
this (negative) 33-bit fraction must be normalized and rounded
to 27 bits. If we shift left until the leading fraction bit
is zero, we get {\it 576$\,|$000000000$\,|$20}, but this
rounds to the illegal value {\it 576$\,|$000000000\/}; we
have over-normalized, since the correct answer is {\it 575$\,|$400000000}.
On the other hand if we start (in some other problem) with the value
{\it 572$\,|$740000000$\,|$05} and stop before over-normalizing
it, we get {\it 575$\,|$400000000$\,|$50} which rounds
to the unnormalized number {\it 575$\,|$400000001\/}; subsequent
normalization yields {\it 576$\,|$000000002} while the correct
answer is {\it 576$\,|$000000001}.
Give a simple, correct rounding rule which resolves
this dilemma on such a machine (without abandoning two's complement
notation).
\exno 19. [24] What is the running time for the {\tt FADD} subroutine
in Program A, in terms of relevant characteristics of the data?
What is the maximum running time, over all inputs that do not
cause overflow or underflow?